Assignment 0

Data 570: Predictive Modelling

Author

RUOCHENYANG - 51255735

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1
[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

Introduction

Hello! My name is RUOCEHNYANG, and I am a first-year master’s student in Data Science at UBC.

#This picture was taken at Greenville island in Vancouver

install.packages("tidyverse", repos = "https://cran.rstudio.com/")
Installing package into 'C:/Users/magic/AppData/Local/R/win-library/4.5'
(as 'lib' is unspecified)
package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\magic\AppData\Local\Temp\Rtmp4wNzSl\downloaded_packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read CSV file (adjust the path if needed)
sales <- read_csv("sales_data.csv")
Rows: 30 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Order ID, Product Name, Category, Customer ID, Customer Gender, Pa...
dbl  (4): Price, Quantity Sold, Total Sales, Customer Age
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the first few rows
head(sales)
# A tibble: 6 × 12
  Date       `Order ID` `Product Name` Category    Price `Quantity Sold`
  <date>     <chr>      <chr>          <chr>       <dbl>           <dbl>
1 2023-01-01 ORD1001    Smartphone     Mobile      300.                1
2 2023-01-02 ORD1002    Laptop         Computers   900.                2
3 2023-01-03 ORD1003    Headphones     Accessories  50.0               3
4 2023-01-04 ORD1004    Tablet         Mobile      200.                1
5 2023-01-05 ORD1005    Smartphone     Mobile      300.                2
6 2023-01-06 ORD1006    Smartwatch     Accessories 150.                1
# ℹ 6 more variables: `Total Sales` <dbl>, `Customer ID` <chr>,
#   `Customer Age` <dbl>, `Customer Gender` <chr>, `Payment Method` <chr>,
#   `Store Location` <chr>
# Display structure of the dataset
str(sales)
spc_tbl_ [30 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Date           : Date[1:30], format: "2023-01-01" "2023-01-02" ...
 $ Order ID       : chr [1:30] "ORD1001" "ORD1002" "ORD1003" "ORD1004" ...
 $ Product Name   : chr [1:30] "Smartphone" "Laptop" "Headphones" "Tablet" ...
 $ Category       : chr [1:30] "Mobile" "Computers" "Accessories" "Mobile" ...
 $ Price          : num [1:30] 300 900 50 200 300 ...
 $ Quantity Sold  : num [1:30] 1 2 3 1 2 1 1 2 1 2 ...
 $ Total Sales    : num [1:30] 300 1800 150 200 600 ...
 $ Customer ID    : chr [1:30] "CUST500" "CUST501" "CUST502" "CUST503" ...
 $ Customer Age   : num [1:30] 34 29 42 38 25 31 27 40 35 33 ...
 $ Customer Gender: chr [1:30] "Female" "Male" "Non-binary" "Female" ...
 $ Payment Method : chr [1:30] "Credit Card" "Cash" "PayPal" "Credit Card" ...
 $ Store Location : chr [1:30] "New York" "Los Angeles" "Chicago" "New York" ...
 - attr(*, "spec")=
  .. cols(
  ..   Date = col_date(format = ""),
  ..   `Order ID` = col_character(),
  ..   `Product Name` = col_character(),
  ..   Category = col_character(),
  ..   Price = col_double(),
  ..   `Quantity Sold` = col_double(),
  ..   `Total Sales` = col_double(),
  ..   `Customer ID` = col_character(),
  ..   `Customer Age` = col_double(),
  ..   `Customer Gender` = col_character(),
  ..   `Payment Method` = col_character(),
  ..   `Store Location` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
# Remove duplicate rows
sales <- distinct(sales)

# Check for missing values
colSums(is.na(sales))
           Date        Order ID    Product Name        Category           Price 
              0               0               0               0               0 
  Quantity Sold     Total Sales     Customer ID    Customer Age Customer Gender 
              0               0               0               0               0 
 Payment Method  Store Location 
              0               0 
sales_ny <- filter(sales, `Store Location` == "New York")
head(sales_ny)
# A tibble: 6 × 12
  Date       `Order ID` `Product Name` Category    Price `Quantity Sold`
  <date>     <chr>      <chr>          <chr>       <dbl>           <dbl>
1 2023-01-01 ORD1001    Smartphone     Mobile      300.                1
2 2023-01-04 ORD1004    Tablet         Mobile      200.                1
3 2023-01-07 ORD1007    Laptop         Computers   800.                1
4 2023-01-10 ORD1010    Headphones     Accessories  60.0               2
5 2023-01-12 ORD1012    Laptop         Computers   950.                1
6 2023-01-15 ORD1015    Headphones     Accessories  90.0               1
# ℹ 6 more variables: `Total Sales` <dbl>, `Customer ID` <chr>,
#   `Customer Age` <dbl>, `Customer Gender` <chr>, `Payment Method` <chr>,
#   `Store Location` <chr>
sales_ny %>%
  group_by(Date) %>%
  summarise(total = sum(`Total Sales`, na.rm = TRUE)) %>%
  arrange(desc(total)) %>%
  slice(1)
# A tibble: 1 × 2
  Date       total
  <date>     <dbl>
1 2023-01-27 2400.
sales %>%
  count(`Payment Method`, sort = TRUE)
# A tibble: 3 × 2
  `Payment Method`     n
  <chr>            <int>
1 Credit Card         13
2 Cash                 9
3 PayPal               8

Visualizition

ggplot(sales, aes(x = `Customer Age`)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(x = "Customer Age", y = "Count") +
  theme_minimal()

ggplot(sales, aes(x = `Quantity Sold`, y = Price)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Relationship between Quantity and Price",
    x = "Quantity Sold",
    y = "Price"
  ) +
  theme_minimal()

Conclusion

As shown in ?@fig-quantity-price, there appears to be a negative relationship between quantity sold and price — as the quantity increases, the price tends to decrease slightly.